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Context. The fast classification of new variable stars is an important step in making them available for further research. Selection of 
science targets from large databases is much more efficient if they have been classified first. Defining the classes in terms of physical 
parameters is also important to get an unbiased statistical view on the variability mechanisms and the borders of instability strips. 
Aims. Our goal is twofold: provide an overview of the stellar variability classes that are presently known, in terms of some relevant 
5_j . stellar parameters; use the class descriptions obtained as the basis for an automated 'supervised classification' of large databases. Such 

automated classification will compare and assign new objects to a set of pre-defined variability training classes. 
Methods. For every variability class, a literature search was performed to find as many well-known member stars as possible, or a 
considerable subset if too many were present. Next, we searched on-line and private databases for their light curves in the visible band 
and performed period analysis and harmonic fitting. The derived light curve parameters are used to describe the classes and define the 
training classifiers. 

Results. We compared the performance of different classifiers in terms of percentage of correct identification, of confusion among 
classes and of computation time. We describe how well the classes can be separated using the proposed set of parameters and how 
future improvements can be made, based on new large databases such as the light curves to be assembled by the CoRoT and Kepler 
r**"* ■ space missions. 

Conclusions. The derived classifiers' performances are so good in terms of success rate and computational speed that we will evaluate 
them in practice from the application of our methodology to a large subset of variable stars in the OGLE database and from comparison 
of the results with published OGLE variable star classifications based on human intervention. These results will be published in a 
subsequent paper. 
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1. Introduction solved light curves for up to 60000 stars with a sampling rate 
J3 . better than 10 minutes during 5 months. Even higher numbers of 
The current rapid progress in astronomical instrumentation pro- stars (> 100000) will be measured for similar purposes and with 
vides us with a torrent of new data. For example, the large scale comparable sampling rate by NASA's Kepler mission (launch 
photometric monitoring of stars with ground-based automated end 2 008, duration 4 years). The ESA Gaia mission (launch fore- 
telescopes and space telescopes delivers us large numbers of seen in 2 oil) will map our Galaxy in three dimensions. About 
high quality light curves. The HIPPARCOS space mission is an one billion stars will be mon i t ored for this purpose, with about 
example of this and led to a large number of new variable stars 80 measurements over 5 years for each star, 
discovered in the huge set of light curves. In the near future, new 

space missions will deliver even larger numbers of light curves Among these large samples, many new variable stars of 

of much higher quality (in terms of sampling and photometric known and unknown type will be present. Extracting them, and 

precision). The CoRoT mission (Convection Rotation and plan- making their characteristics and data available to the scientific 

etary Transits, launched on 27 December 2006) has two main community within a reasonable timescale, will make these cata- 

scientific goals: asteroseismology and the search for exoplanets lo S ues reall y useful - II is clear that automated methods have to be 

using the transit method. The latter purpose requires the photo- used here - Minin g techniques for large databases are more and 

metric monitoring of a large number of stars with high precision. more frequently used in astronomy. Although we are far from 

As a consequence, this mission will produce excellent time re- reproducing capabilities of the human brain, a lot of work can 

be done efficiently using intelligent computer codes. 



* The documented classification software codes as well as the light In this P a P er , we present automated supervised classification 

curves and the set of classification parameters for the definition stars, methods for variable stars. Special attention is paid to compu- 

are available electronically from the A&A anonymous ftp site. tational speed and robustness, with the intention to apply the 

** Figures 5 to 12 are only available in the electronic form of the paper, methods to the huge datasets expected to come from the CoRoT, 
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Kepler and Gaia satellite missions. We tackle this problem with 
two parallel strategies. In the first, we construct a Gaussian mix- 
ture model. Here, the main goals are to optimize speed, simplic- 
ity and interpretability of the model rather than optimizing the 
classifiers' performance. In the second approach, a battery of 
state-of-the-art pattern recognition techniques is applied to the 
same training set in order to select the best performing algorithm 
by minimizing the misclassification rate. The latter methods are 
more sophisticated and will be discussed in more detail in a sub- 
sequent paper (Sarro et al., in preparation). 

For a supervised classification scheme, we need to prede- 
fine the classes. Every new object in a database to be classified 
will then be assigned to one of those classes (called definition 
or training classes) with a certain probability. The construction 
of the definition classes for stellar variability is, therefore, an 
important part of this paper. Not only are these classes neces- 
sary for this type of classification method, they also provide 
us with physical parameters describing the different variability 
types. They allow us to attain a good view on the separation and 
overlap of the classes in parameter space. For every variability 
class, we derive descriptive parameters using the light curves of 
their known member stars. We use exclusively light curve in- 
formation for the basic methodology we present here, because 
additional information is not always available and we want to 
see how well the classes can be described (and separated) using 
only this minimal amount of information. This way, the method 
is broadly applicable. It is easy to adapt the methods when more 
information such as colors, radial velocities, etc. is available. 

The first part of this paper is devoted to the description of 
the stellar variability classes and the parameter derivation. The 
classes are visualized in parameter space. In the second part, a 
supervised classifier based on multivariate statistics is presented 
in detail. We also summarize the results of a detailed statistical 
study on Machine Learning methods such as Bayesian Neural 
Networks. Our variability classes are used to train the classi- 
fiers and the performance is discussed. In a subsequent paper, the 
methods will be applied to a large selection of OGLE (Optical 
Gravitational Lensing Experiment) light curves, while we plan 
to update the training classes from the CoRoT exoplanet light 
curves in the coming two years. 



2. Description of stellar variability classes from 
photometric time series 

We provide an astrophysical description of the stellar variabil- 
ity classes by means of a fixed set of parameters. These pa- 
rameters are derived using the light curves of known mem- 
ber stars. An extensive literature search provided us with 
the object identifiers of well-known class members. We re- 
trieved their available light curves from different sources. The 
main sources are the HIPPARCOS space data (lESAlll997t 
iPerrvman &ES A 1997) an d the Geneva and OGLE ground- 
based data~|u dalski e tall Il999at IWvrzvkowski et all l2004t 
ISoszvnski et al.ll2002l) . Other sources include ULTRACAM data 
(ULTR A-fast, triple-beam CCD CAMera), see lDhillon & Marshl 
( |2001|) . MOST data (Microvariability and Oscillations of 
STars, see http://www.astw.ubc.ca/MOST/), WET data (Whole 
Earth Telescope, see ht tp://wet.physics. iastate. edu/), ROTOR 
data (JGrankin et al.ll2.Q07b . Lapoune/CFHT data (Fontaine, pri- 
vate communication), and ESO-LTPV data (European Southern 
Observatory Long-T erm Photometry of Variables project), see 
ISterken et al.l d 1995b . TableQ] lists the number of light curves 
used from each instrument, together with their average total time 



Table 1. The sources and numbers of light curves NLC used to 
define the classes, their average total time span < T,„, > and then- 
average number of measurements < N po i„ ts > 



Instrument 


NLC 


< Tun > (days) 


"^ ™ points ' > 


HIPPARCOS 


1044 


1097 


103 


GENEVA 


118 


3809 


175 


OGLE 


527 


1067 


329 


ULTRACAM 


19 


22 


15820 


MOST 


3 


34 


59170 


WET 


3 


5.6 


11643 


ROTOR 


3 


5066 


881 


CFHT 


3 


0.18 


1520 


ESO-LTPV 


20 


2198 


209 



span and their average number of measurements. For every con- 
sidered class (see Table©, we have tried to find the best avail- 
able light curves, allowing recovery of the class' typical vari- 
ability. Moreover, in order to be consistent in our description of 
the classes, we tried, as much as possible, to use light curves 
in the visible band (V-mag). This was not possible for all the 
classes however, due to a lack of light curves in the V-band, or 
an inadequate temporal sampling of the available V-band light 
curves. The temporal sampling (total time span and size of the 
time steps) is of primordial importance when seeking a reli- 
able description of the variability present in the light curves. 
While HIPPARCOS light curves, for example, are adequate in 
describing the long term variability of Mira stars, they do not 
allow recovery of the rapid photometric variations seen in some 
classes such as rapidly oscillating Ap stars. We used WET or 
ULTRACAM data in this case, dedicated to this type of object. 
For the double-mode Cepheids, the RR-Lyrae stars of type RRd 
and the eclipsing binary classes, we used OGLE light curves, 
since they both have an adequate total time span and a better 
sampling than the HIPPARCOS light curves. 

For every definition class, mean parameter values and vari- 
ances are calculated. Every variability class thus corresponds to 
a region in a multi-dimensional parameter space. We investigate 
how well the classes are separated with our description and point 
out where additional information is needed to make a clear dis- 
tinction. Classes showing a large overlap will have a high prob- 
ability of resulting in misclassifications when using them in the 
training set. 

The classes considered are listed in Table|2] together with 
the code we assigned to them and the number of light curves we 
used to define the class. We use this coding further in this paper, 
and in the reference list, to indicate which reference relates to 
which variability type. For completeness, we also list the ranges 
for r e ff, logg, and the range for the dominant frequencies and 
their amplitudes present in the light curves. The first two phys- 
ical parameters cannot be measured directly, but are calculated 
from modeling. We do not use these parameters for classifica- 
tion purposes here because they are in general not available for 
newly measured stars. Also, for some classes, such as those with 
non-periodic variability or outbursts, it is not possible to define 
a reliable range for these parameters. The ranges for the light 
curve parameters result from our analysis, as described in Sect. 
2.1. 

We stress that the classes considered in Table|2]constitute the 
vast majority of known stellar variability classes, but certainly 
not all of them. In particular, we considered only those classes 
whose members show clear and well-understood visual photo- 
metric variability. Several additional classes exist which were 
defined dominantly on the basis of spectroscopic diagnostics or 
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Example light curves for all the classes 

Light curve analysis (2.1) 
Light curve parameters (2.1) 



Transformed light curve parameters: classification attributes (2.1) 



Classification results (4.1) 
Performance evaluation 



Feature selection (3.2) 

,» I 

Gaussian Mixture Classifier (3.1) Machine Learning Classifiers (3.2) 



Classification results (4.2) 
Performance evaluation 



the required resolution to allow a good fitting, and computation 
time. 

We searched for up to a maximum of three independent fre- 
quencies for every star. The procedure was as follows: the Lomb- 
Scargle periodogram was calculated and the highest peak was 
selected. The corresponding frequency value f\ was then used to 
calculate a harmonic fit to the light curve of the form: 



y(t) = ^(aj sin 2nf\jt + bj cos Infijt) + b , 

7=1 



(1) 



with y(t) the magnitude as a function of time. Next, this curve 
was subtracted from the time series (prewhitening) and a new 
Lomb-Scargle periodogram was computed. The same procedure 
was repeated until three frequencies were found. Finally, the 
three frequencies were used to make a harmonic best-fit to the 
original (trend subtracted) time series: 



Fig. 1. Schematic overview of the different steps (sections indi- 
cated) in the development and comparison of the classification 
methods presented in this paper. 



photometry at wavelengths outside the visible range. For some 
classes, we were unable to find good consistent light curves. 
Examples of omitted classes are hydrogen-deficient carbon stars, 
extreme helium stars, y or X-ray bursts, pulsars, etc. Given that 
we do not use diagnostics besides light curves at or around vis- 
ible wavelengths in our methods presently, these classes are not 
considered here. In the following we describe our methods in 
detail. A summary of the different steps is shown in Fig.Q] 

2. 1 . Light curve analysis and parameter selection 

After removal of bad quality measurements, the photometric 
time series of the definition stars were subjected to analysis. 
First, we checked for possible linear trends of the form a + bT, 
with a the intercept, b the slope and T the time. These were 
subtracted, as they can have a large influence on the frequency 
spectrum. The larger the trend is for pulsating stars, the more the 
frequency values we find can deviate from the stars' real pulsa- 
tion frequencies. 

Subsequently, we performed a Fourier analysis to find pe- 
riodicities in th e light curves. We used the well-known Lomb- 
Scargle method (Lomb 1976HScarg le 1982). The computer code 
to calculate the periodograms was based on an algorithm writ- 
ten by J. Cuyper s. It followed outlines given by Ponman (1981) 
and lKurtad 1985b focussed on speedy calculations. As is the case 
with all frequency determination methods, we needed to specify 
a search range for frequencies (/o, fy and A/). Since we were 
dealing with data coming from different instruments, it was in- 
appropriate to use the same search range for all the light curves. 
We adapted it to each light curve's sampling, and took the start- 
ing frequency as /o = l/T, ot , with T t0 , the total time span of the 
observations. A frequency step A/ = 0.1/T fDf was taken. For the 
highest frequency, we used the average of the inverse time inter- 
vals between the measurements: fy = 0.5 < I /AT > as a pseudo 
Nyquist frequency. Note that fy is equal to the Nyquist fre- 
quency in the case of equidistant sampling. F or particular cases, 
an ev en higher upper limit can be used (see lEver & Bartholdil 
1999). Our upper limit should be seen as a compromise between 



y(t) = ^ 2j(«y sin Infijt + b u cos Infijt) + b . 

1=1 ;=1 



(2) 



The parameter bo is the mean magnitude value of the light curve. 
The frequency values /; and the Fourier coefficients ay and b,j 
provide us with an overall good description of light curves, if the 
latter are periodic and do not show large outbursts. It is important 
to note, in the context of classification, that the set of Fourier co- 
efficients obtained here is not unique: identical light curves can 
have different coefficients, just because the zero-point of their 
measurements is different. The Fourier coefficients are thus not 
invariant under time-translation of the light curve. Since we want 
to classify light curves, this is inconvenient. We ideally want all 
light curves, identical apart from a time-translation, to have the 
same set of parameters (called attributes when used for classi- 
fying). On the other hand, we want different parameter sets to 
correspond to different light curves as much as possible. To ob- 
tain this, one can first transform the Fourier coefficient into a set 
of amplitudes Ay and phases PH/j as follows: 



V/ = a/4 



+ M 



PHjj - arctan(£>y, ay), 



(3) 

(4) 



with the arctangent function returning phase angles in the in- 
terval ]— n,+ri]. This provides us with a completely equivalent 
description of the light curve: 



y(t) = 2^L Ai ' sia i2nfijt + PHi ^ + b °- 

i-l 7=1 



(5) 



The positive amplitudes Ay are already time-translation invari- 
ant, but the phases PH/j are not. This invariance can be ob- 
tained for the phases as well, by putting PH\ \ equal to zero and 
changing the other phases accordingly (equivalent to a suitable 
time-translation, depending on the zero-point of the light curve). 
Although arbitrary, it is preferable to choose PH\ \ as the refer- 
ence, since this is the phase of the most significant component in 
the light curve. The new phases now become: 



PH' 



f jfu 



arctan(fe y , ay) - (— )arctan(fcn,an), 



(6) 



with PH' n = 0. The factor (jfi/fi) in this expression is the ra- 
tio of the frequency of the y'-th harmonic of f to the frequency 
/i, because the first harmonic of /i has been chosen as the ref- 
erence. Note that these new phases can have values between 
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Table 2. Stellar variability classes considered in this study, their code, the number of light curves we used (NLC) and their source. 
Also listed (when relevant for the class) are the ranges for the parameters r er y and log g if they could be determined from the 
literature. The last two columns list the range for the dominant frequencies (/i) and their amplitudes (An) present in the light 
curves, resulting from our analysis (Sect. 2.1). 



Class 


NLC 


Instrument 


Range T cl , 


Range log g 


Range jfi (c/d) 


Range An (mag) 


Periodically variable supergiants (PVSG) 


76 


HIPPARCOS/GENEVA/ESO 


3890 - 56234K 


1.0-4.5 


0.0004-14.1668 


0.0027 - 0.4689 


Pulsating Be-stars (BE) 


57 


HIPPARCOS/GENEVA 


17100 -23850K 


3.30-4.33 


0.0003 - 14.0196 


0.0023 - 2.9385 


/3-Cephei stars (BCEP) 


58 


HIPPARCOS/GENEVA 


18238 -36813K 


3.18-4.30 


0.0180-11.3618 


0.0030-0.1344 


Classical Cepheids (CLCEP) 


195 


HIPPARCOS/GENEVA 


4800 - 6648K 


1.45-2.6 


0.0222 - 0.4954 


0.0493-1.1895 


Beat (double-mode)-Cepheids (DMCEP) 


95 


OGLE 


5000 - 7000K 


2-3.5 


0.5836-1.7756 


0.0544-0.1878 


Population II Cepheids (PTCEP) 


24 


HIPPARCOS 


5200 - 6550K 




0.0038- 1.5377 


0.1561-0.6364 


Chemically peculiar stars (CP) 


63 


HIPPARCOS/GENEVA 


6500-18900K 


3.2-4.6 


0.0076-33.4158 


0.0027 - 0.0604 


5-Scuti stars (DSCUT) 


139 


HIPPARCOS/GENEVA 


6550-9126K 


3.5-4.25 


0.0109-26.9967 


0.0043 - 0.3841 


/1-Bootis stars (LBOO) 


13 


HIPPARCOS 


6637 - 9290K 


3.4-4.1 


7.0865 - 14.5035 


0.0036 - 0.0143 


SX-Phe stars (SXPHE) 


7 


HIPPARCOS/GENEVA 


6940 - 8690K 


3.34-4.3 


6.2281 - 16.2625 


0.0138-0.3373 


y-Doradus stars (GDOR) 


35 


HIPPARCOS/GENEVA 


5980 - 7375K 


3.32-4.58 


0.3803 - 13.7933 


0.0048 - 0.0325 


Luminous Blue Variables (LBV) 


21 


HIPPARCOS/GENEVA/ESO 


8000 - 30000K 




0.0004 - 2.0036 


0.0296 - 0.9877 


Mira stars (MIRA) 


144 


HIPPARCOS 


2500 - 3500K 




0.0020 - 0.6630 


0.2828-3.9132 


Semi-Regular stars (SR) 


42 


HIPPARCOS 


2500 - 3500K 




0.0012-11.2496 


0.0216-1.9163 


RR-Lyrae, type RRab (RRAB) 


129 


HIPPARCOS/GENEVA 


6100-7400K 


2.5-3.0 


1.2150-9.6197 


0.0745 - 0.5507 


RR-Lyrae, type RRc (RRC) 


29 


HIPPARCOS 






2.2289-4.5177 


0.0313-0.2983 


RR-Lyrae, type RRd (RRD) 


57 


OGLE 






2.0397-2.8177 


0.0899 - 0.2173 


RV-Tauri stars (RVTAU) 


13 


HIPPARCOS/GENEVA 


4250 - 7300K 


-0.9 - 2.0 


0.0011-1.0280 


0.2851-2.3831 


Slowly-pulsating B stars (SPB) 


47 


HIPPARCOS/GENEVA/MOST 


12000 -18450K 


3.8-4.4 


0.1394-3.7625 


0.0036 - 0.0982 


Solar-like oscillations in red giants (SLR) 


1 


MOST 






0.0352 


0.0014 


Pulsating subdwarf B stars (SDBV) 


16 


ULTRACAM 


23000 - 32000K 


4.5-5.6 


242.5726-612.7225 


0.0038 - 0.0739 


Pulsating DA white dwarfs (DAV) 


2 


WET 


10350- 11850K 


7.73-8.74 


149.2038-401.5197 


0.0020 - 0.0226 


Pulsating DB white dwarfs (DBV) 


1 


WET/CFHT 


11000-30000K 


~8 


150.5844 


0.0401 


GW-Virginis stars (GWVIR) 


2 


CFHT 


70000 - 170000K 




192.9965 - 215.3986 


0.0141-0.0216 


Rapidly oscillating Ap stars (ROAP) 


4 


WET/ESO 


6800 - 8400K 


3.77-4.52 


123.0299 - 235.0878 


0.0013-0.0022 


T-Tauri stars (TTAU) 


17 


HIPPARCOS/GENEVA 


3660 - 4920K 


3.8-4.5 


0.0009-11.0231 


0.0092 - 0.8925 


Herbig-Ae/Be stars (HAEBE) 


21 


HIPPARCOS/GENEVA 


5900-16000K 


3.5-5 


0.0009-10.9516 


0.0053 - 0.8925 


FU-Ori stars (FUORI) 


3 


ROTOR 


13000 - 15000K 




0.0002 - 0.0006 


0.0432-0.2181 


Wolf-Rayet stars (WR) 


63 


HIPPARCOS/GENEVA/ESO/MOST 


14800 -91000K 




0.0003 - 15.9092 


0.0016-0.3546 


X-Ray binaries (XB) 


9 


HIPPARCOS/GENEVA 






0.0057-11.2272 


0.0063-0.0813 


Cataclysmic variables (CV) 


3 


ULTRACAM 






27.5243 - 36.9521 


0.1838-0.5540 


Eclipsing binary, type EA (EA) 


169 


OGLE 






0.0127-3.1006 


0.0371-0.2621 


Eclipsing binary, type EB (EB) 


147 


OGLE 






0.0175-4.5895 


0.0454 - 0.7074 


Eclipsing binary, type EW (EW) 


59 


OGLE 






0.2232-8.3018 


0.0376 - 0.4002 


Ellipsoidal binaries (ELL) 


16 


HIPPARCOS/GENEVA 






0.1071-3.5003 


0.0136 - 0.0629 



-oo and +oo. We can now constrain the values to the interval 
] - n, +n], since all phases differing with an integer multiple of 
2n are equivalent. This can be done using the same arctangent 
function: 



PH'/j = arctan(sin(P#;p, cos(PH' u )). 



(7) 



The parameters Ay and PH"- now provide us with a time- 
translation invariant description of the light curves and are suit- 
able for classification purposes. Note that this translation invari- 
ance strictly only holds for monoperiodic light curves, and is 
not valid for multiperiodic light curves. Alternate transforma- 
tions are being investigated to extend the translation invariance 
to multiperiodic light curves as well. For ease of notation, we 
drop the apostrophes when referring to the phases PH".. 

Another important parameter, which is also calculated dur- 
ing the fitting procedure, is the ratio of the variances v/i /v in 
the light curve, after and before subtraction of a harmonic fit 
with only the frequency f\. This parameter is very useful for 
discriminating between multi- and monoperiodic pulsators. Its 
value is much smaller for monoperiodic pulsators, where most 
of the variance in the light curve can be explained with a har- 
monic fit with only f\ . 

In total, we calculated 28 parameters starting from the origi- 
nal time series: the slope b of the linear trend, 3 frequencies, 12 
amplitudes, 1 1 phases {PH\\ is always zero and can be dropped) 
and 1 variance ratio. This way, the original time series, which can 
vary in length and number of measurements, were transformed 
into an equal number of descriptive parameters for every star. 

We calculated the same parameter set for each star, irrespec- 
tive of the variability class they belong to. This set provided us 



with an overall good description of the light curves for pulsat- 
ing stars, and even did well for eclipsing binaries. It is clear, 
however, that the whole parameter set might not be needed for 
distinguishing, say, between class A and class B. The distinc- 
tion between a Classical Cepheid and a Mira star is easily made 
with only the parameters f\ and A\\, other parameters are thus 
not necessary and might even be completely irrelevant for this 
example. For other classes, we have to use more parameters to 
reach a clear distinction. 

With these 28 selected parameters, we found a good com- 
promise between maximum separability of all the classes and 
a minimum number of descriptive parameters. Our class defini- 
tions are based on the entire parameter set described above. A 
more detailed study on statistical attribute selection methods is 
presented in Sect. 3.2.1, as this is closely related to the perfor- 
mance of a classifier. 

2.2. Stellar variability classes in parameter space 

The different variability classes can now be represented as sets 
of points in multi-dimensional parameter space. Each point in 
every set corresponds to the light curve parameters of one of 
the class' member stars. The more the clouds are separated from 
each other, the better the classes are defined, and the fewer the 
misclassifications which will occur in the case of a supervised 
classification, using these class definitions. As an external check 
for the quality of our class definitions, we performed a visual in- 
spection of phase plots made with only f\, for the complete set. 
If these were of dubious quality (or the wrong variability type), 
the objects were deleted from the class definition. It turned out 
to be very important to retain only definition stars with high- 
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quality light curves. This quality is much more important than 
the number of stars to define the class, provided that enough stars 
are available for a good sampling of the class' typical parame- 
ter ranges. Visualizing the classes in multi-dimensional space 
is difficult. Therefore we plot only one parameter at a time for 
every class. Figures [2] [5] lolfTOl show the spread of the derived 
light curve parameters for all the classes considered. Because 
the range can be quite large for frequencies and amplitudes, we 
have plotted the logarithm of the values (base 10 for the frequen- 
cies and base 2 for the amplitudes). As can be seen from Fig. [2] 
using only fa and An, we already attain a good distinction be- 
tween monoperiodically pulsating stars such as Miras, RR-Lyrae 
and Cepheids. For the multiperiodic pulsators, a lot of overlap 
is present and more parameters (fa, fa, the A2j and the Ay) are 
needed to distinguish between those classes. If we look at the fre- 
quencies and amplitudes, we see that clustering is less apparent 
for the non-periodic variables such as Wolf-Rayet stars, T-Tauri 
stars and Herbig Ae/Be stars. For some of those classes, we only 
have a small number of light curves, i.e. we do not have a good 
'sampling' of the distribution (selection effect). The main reason 
for their broad distribution is, however, the frequency spectrum: 
for the non-periodic variables, the periodogram will show a lot of 
peaks over a large frequency range, and selecting three of them 
is not adequate for describing the light curve. Selecting more 
than three, however, entails the danger of picking non-significant 
peaks. The phase values PHu corresponding to the harmonics of 
fa cluster especially well for the eclipsing binary classes, as can 
be expected from the nature of their light curves. These param- 
eters are valuable for separating eclipsing binaries from other 
variables. The phase values for the harmonics of fa and fa do not 
show significant clustering structure. On the contrary, they tend 
to be rather uniformly distributed for every class and thus, they 
will likely not constitute very informative attributes for classifi- 
cation. This is not surprising, since these phases belong to less 
significant signal components and will vary more randomly for 
the majority of the stars in our training set. In the next section, 
we discuss more precise methods for assessing the separation 
and overlap of the classes in parameter space. 

Complementary to these plots, we have conducted a more 
detailed analysis of the statistical properties of the training set. 
This analysis is of importance for a sensible interpretation of 
the class assignments obtained for unknown objects, since the 
class boundaries of the classifiers depend critically on the den- 
sities of examples of each class as functions of the classification 
parameters. This analysis comprises i) the computation of box- 
and-whiskers plots for all the attributes used in classification (see 
Fig-El El and[T2lfor example); ii) the search for correlations be- 
tween the different parameters; iii) the computation of Id, 2d and 
3d nonparametric density estimates (see Fig.|4]for an easily in- 
terpretable hexagonal histogram); iv) clustering analysis of each 
class separately and for the complete training set. The results of 
this analysis are especially useful for guiding the extension of 
the training set as new examples become available to users, such 
as those from CoRoT and Gaia. 



3. Supervised Classification 

The class descriptions we attained, form the basis of the so- 
called 'Supervised Classification' . This classification method as- 
signs every new object to one of a set of pre-defined classes 
(called 'training classes'), meaning that, given the time series 
characteristics described above, the system gives a set of proba- 
bilities that the source of the time series belongs to one of the set 
of classes listed in TableQ] 



A supervised classification can be done in many ways. The 
most suitable method depends on the kind of data to be classi- 
fied, the required performance and the available computational 
power. We focus here on a statistical method based on multi- 
variate analysis, also known as the 'Gaussian Mixture Model'. 
We have chosen for a fast and easily adaptable code written 
in FORTRAN. We also summarize the results of a detailed study 
of other supervised classification methods, based on Artificial 
Intelligence techniques. 

3.1. Multivariate Gaussian mixture classifier 

We assume that the descriptive parameters for every class have 
a multinomial distribution. This is a reasonable assumption for 
a first approach. There is no reason to assume a more compli- 
cated distribution function, unless there is clear evidence. The 
added advantages of the multinormal distribution are the well- 
known properties and the relatively simple calculations. We use 
our derived light curve parameters to estimate the mean and the 
variance of the multinormal distributions. If the vector Xij repre- 
sents the parameters of light curve number j belonging to class i, 
the following quantities are calculated for every variability class. 
The class mean vector of length P (number of light curve param- 
eters, P — 28 in our method for example): 



— 1 






and the class variance-covariance matrix of dimension PxP: 
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Every class is now defined by a mean vector X, and a variance- 
covariance matrix 5,, which corresponds to the mean and the 
variance of a normal distribution in the one-dimensional case. 

If we want to classify a new object, we first have to calculate 
the same light curve parameters as described in Sect. 2. We can 
then derive the statistical distance of this object with respect to 
the different classes, and assign the object to the nearest (most 
probable) class. If X denotes the parameters for the new object, 
we calculate the following statistical distance for every class: 



D i = (X-X i )'S- 1 (X-X i ) + }n\S i \ 1 



(10) 



with \Sj \ the determinant of the variance-covariance matrix (e.g. 
Sharma 1996). The first term of A is known as the squared 
Mahalanobis distance. The object is now assigned to class i for 
which Dj is minimal. This minimum of D, is equivalent to the 
maximum of the corresponding density function (under the as- 
sumption of a multinormal distribution): 



MX) = 



(2n) p l 2 \Si\ 



1/2 



esp~(X-Xd'Sr\x-Xd. (ID 



This statistical class assignment method can cause an object to 
be assigned to a certain class even if its light curve parameters 
deviate from the class' typical parameter values. This is a draw- 
back, and can cause contamination in the classification results. 
It does, however, has an important advantage: objects near the 
border of the class can still be correctly assigned to the class. 
If one is only interested in objects that are very similar to the 
objects used to define the class, one can define a cutoff value 
for the Mahalanobis distance. Objects that are too far from the 
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Fig. 2. The range for the frequency f t (in cycles/day), its first harmonic amplitude A n (in magnitude), the phases PH \i (in radians) 
and the variance ratio v/i /v (varrat) for all the 35 considered variability classes listed in TableQ] For visibility reasons, we have 
plotted the logarithm of the frequency and amplitude values. Every symbol in the plots corresponds to the parameter value of 
exactly one light curve. In this way, we attempt to visualize the distribution of the light curve parameters, in addition to their mere 
range. 
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Fig. 3. Box-and-whiskers plot of the logarithm of f\ for 29 classes with sufficient members to define such tools in the training set. 
Central boxes represent the median and interquantile ranges (25 to 75%) and the outer whiskers represent rule-of-thumb boundaries 
for the definition of outliers (1.5 the quartile range). The box widths are proportional to the number of examples in the class. 
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Fig. 4. Hexagonal representation of the two dimensional density of examples of the Classical Cepheids class in the log(P)-log(/?2i) 
space. The quantity R21 = A^/A g represents the ratio of the second to the first harmonic amplitude of /r. This plot is comparable 
to Fig. 3 of lUdalski et alJ (Il999bh . 



class centers will then not be assigned to any of the classes. To 
illustrate this, consider a classifier where only /j would be used 
as a classification attribute, and suppose we are interested in /?- 
Cephei stars. If we do not want a star to be classified as /3-Cephei 
if the value of f\ is larger than 15 c/d, we have to take a cutoff 
value for the Mahalanobis distance equal to 4 in frequency space 
(this value only holds for our definition of the /?-Cephei class). 
In terms of probabilities: objects with a Mahalanobis distance 
larger than 4 are more than Act away from the class center (the 



class' mean value for /!), and are therefore very unlikely to be- 
long to the class. 

We emphasize the difference between a supervised classifi- 
cation method as defined here and an extraction method: a su- 
pervised classifier assigns new objects to one of a set of defini- 
tion classes with a certain probability, given the object's derived 
parameters. An extractor, on the other hand, will only select 
those objects in the database for which the derived parameters 
fall within a certain range. Extractor methods are typically used 
by scientists only interested in one class of objects. The speci- 
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fied parameter range for an extractor (based on the knowledge 
of the variability type) can be chosen as to minimize the number 
of contaminating objects, to make sure that the majority of the 
selected objects will indeed be of the correct type. Of course, 
extraction methods can also be applied to our derived parameter 
set. The goal of our supervised classifier is much broader, how- 
ever: we consider all the known variability classes at once and 
also get a better view on the differences and similarities between 
the classes. Moreover, our method does not need visual inspec- 
tion of the light curves, while this was always needed in practice 
with extraction. On top of this, our class definitions can also be 
used to specify parameter ranges for extraction methods. 

3.2. Machine learning classifiers 

Following standard practice in the field of pattern recognition or 
statistical learning, we have adopted a parallel approach where 
we allow for more flexibility in the definition of the models used 
to classify the data. The Gaussian mixtures model presented in 
the previous section induces hyperquadratic boundaries between 
classes (with hyperspheres/hyperellipses as special cases). This 
has the advantage of providing a fast method for the detection 
of outliers (objects at large Mahalanobis distances from all cen- 
ters) and easy interpretation of results. On the other hand, more 
sophisticated methods offer the flexibility to reproduce more 
complicated boundaries between classes, at the expense of more 
complex models with varying degrees of interpretability. 

A common problem presented in the development of su- 
pervised classification applications based on statistical learning 
methods, is the search for the optimal trade-off between the two 
components of the classifier error. In general, this error is com- 
posed of two elements: the bias and the variance. The former is 
due to the inability of our models to reproduce the real decision 
boundaries between classes. To illustrate this kind of error, we 
can imagine a set of training examples such that any point above 
the y = sin(jc) curve belongs to class A and any point below it, to 
class B. Here, classes A and B are separable (unless we add noise 
to the class assignment), and the decision boundary is precisely 
the curve y = sin(x). Obviously, if we try to solve this toy clas- 
sification problem with a classifier inducing linear boundaries 
we will inevitably have a large bias component to the total error. 
The second component (the variance) is due to the finite nature 
of the training set and the fact that it is only one realization of 
the random process of drawing samples from the true (but un- 
known) probability density of having an object at a given point 
in the hyperspace of attributes. 

If the model used to separate the classes in the classification 
problem is parametric, then we can always reduce the bias term 
by adding more and more degrees of freedom. In the Gaussian 
mixtures case, where we model the probability densities with 
multivariate Gaussians, this would be equivalent to describing 
each class by the sum of several components (i.e. several multi- 
variate Gaussians). It has to be kept in mind, however, that there 
is an optimal number of components beyond which the decrease 
in the bias term is more than offset by an increase of the variance, 
due to the data being overfitted by the complexity of the model. 
The natural consequence is the loss of generalization capacities 
of the classifier, where the generalization ability is understood as 
the capacity of the model to correctly predict the class of unseen 
examples based on the inferences drawn from the training set. 

We computed models allowing for more complex decision 
boundaries where the bias-variance trade-off is sought, using 
standard procedures. Here we present brief outlines of the meth- 
ods and a summary of the results, while a more detailed analysis 



will be published in a forthcoming paper (Sarro et al., in prepara- 
tion). We made use of what is widely known as Feature Selection 
Methods. These methods can be of several types and are used to 
counteract the pernicious effect of irrelevant and/or correlated 
attributes for the performance of classifiers. The robustness of a 
classifier to the degradation produced by irrelevance and correla- 
tion depends on the theoretical grounds on which the learning al- 
gorithms are based. Thus, detailed studies have to be conducted 
to find the optimal subset of attributes for a given problem. The 
interested reader can find a good introduction to the field and ref- 
erence s to the methods used in this paper in iGuvon & Elisseefn 
(120031) . 

We adopted two strategies: training a unique classifier for 
the 29 classes with sufficient stars for a reliable estimate of the 
errors, or adopting a multistage approach where several large 
groups with vast numbers of examples and well identified sub- 
groups (eclipsing binaries, Cepheids, RR-Lyrae and Long Period 
Variables) are classified first by specialized modules in a sequen- 
tial approach and then, objects not belonging to any of these 
classes are passed to a final classifier of reduced complexity. 

3.2.1. Feature selection 

Feature selection methods fall into one of two categories: filter 
and wrapper methods. Filter methods rank the attributes (or sub- 
sets of them) based on some criterion independent of the model 
to be implemented for classification (e.g., the mutual informa- 
tion between the attribute and the class or between attributes, or 
the statistical correlation between them). Wrapper methods, on 
the other hand, explore the space of possible attribute subsets 
and score each combination according to some assessment of 
the performance of the classifier trained only on the attributes in- 
cluded in the subset. The exhaustive search for an optimal subset 
in the space of all possible combinations rapidly becomes unfea- 
sible as the total number of attributes in the original set increases. 
Therefore, some sort of heuristic search, based on expected prop- 
erties of the problem, has to be used in order to accomplish the 
selection stage in reasonable times. 

We applied several filtering techniques based on Information 
Theory (Information Gain, Gain Ratio and symmetrical uncer- 
tainty) and statistical correlations to the set of attributes de- 
scribed in Sect. 2. 1, extended with peak-to-peak amplitudes, har- 
monic amplitude ratios (within and across frequencies) and fre- 
quency ratios. These techniques were combined with appropriate 
search heuristics in the space of feature subsets. Furthermore, 
we also investigated feature relevance by means of wrapper 
techniques applied to Bayesian networks and decision trees, 
but not to the Bayesian combination of neural networks or to 
Support Vector Machines due to the excessive computational 
cost of combining the search for the optimal feature subset and 
the search for the classifier's optimal set of parameters. The 
Bayesian model averaging of neural networks in the implemen- 
tation used here, incorporates automatic relevance determination 
by means of hyperparameters. For this reason, we have not per- 
formed any feature selection. 

In general, there is no well-founded way to combine the re- 
sults of these methods. Each approach conveys a different per- 
spective of the data and it is only by careful analysis of the rank- 
ings and selected subsets that particular choices can be made. 
As a general rule, we have combined the rankings of the dif- 
ferent methodologies when dealing with single stage classifiers, 
whereas for sequential classifiers, each stage had its own feature 
selection process. When feasible in terms of computation time 
(e.g. for Bayesian networks), the attribute subsets were scored 
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in the wrapper approach. Otherwise, several filter methods were 
applied and the best results used. 

3.2.2. Bayesian networks classifier 

Bayesian networks are probabilistic graphical models where the 
uncertainty inherent to an expert system is encoded into two 
basic structures: a graphical structure S representing the con- 
ditional independence relations amongst the differ ent attributes , 
and a joint probability distribution for its nodes (Pearl 1988). 
The nodes of the graph represent the variables (attributes) used 
to describe the examples (instances). There is one special node 
corresponding to the class attribute. Here, we have constructed 
models of the family known as fc-dependent Bayesian classifier 
(Sahami 1996) with k, the maximum number of parents allowed 
for a node in the graph, set to a maximum of 3 (it was checked 
that higher degrees of dependency did not produce improve- 
ments in the classifier performance). 

The induction of Bayesian classifiers implies finding an op- 
timal structure and probability distribution according to it. We 
have opted for a score and search approach, where the score is 
based on the marginal li kelihood of the structu r e as im plemented 
in the K2 algorithm bv lCooper & Herskovitsl d!992l) . Although 
there are implementations of the ^-dependent Bayesian classi- 
fiers for continuous variables, also known as Gaussian networks, 
we have obtained significantly better results with discretized 
variables. The discretization process is based o n the Minimum 
Descri ption Length principle as proposed in iFavvad & Iranil 
(1993]). It is carried out as part of the cross validation experi- 
ments to avoid overfitting the training set. 



3.2.3. Bayesian average of artificial neural networks 
classifier 

Artificial neural networks are probably the most popular 
methodology for classification and clustering. They are taken 
from the world of Artificial Intelligence. In its most frequent im- 
plementation, it is defined as a feedforward network made up of 
several layers of interconnected units or neurons. With appro- 
priate choices for the computations carried out by the ne urons , 
we have the well-known multilayer perceptron. Bis hop] d 1995b 
has written an excellent introductory text to the world of neural 
networks, statistical learning and pattern recognition. 

We do not deviate here from this widely accepted archi- 
tecture but use a training approach other than the popular er- 
ror backpropagation algorithm. Instead of the maximum likeli- 
hood estimate provided by it, we use Bayesian Model Averaging 
(BMA). BMA combines the predictions of several models (net- 
works) weighting each by the a posterior probability of its 
parameters (the weights of network synapses) given the train- 
ing data. Fo r a more in-depth de scription of the methods, see 
iNeall d 1996b or lSarroetaH d2006l) . In the following, we use the 
acronym BAANN to refer to the averaging of artificial neural 
networks. 



3.2.4. Support vector machines classifier 

Support vector machines (SVM) are base d on the minimiza- 
tion of the structural risk dGunn et al.l ll997). The structural risk 
can be proven to be upper-bounded by the sum of the empiri- 
cal risk and the optimism, a quantity dependent on the Vapnik- 
Chervonenkis dimension of the chosen set of classifier func- 
tions. For linear discriminant functions, minimizing the opti- 



mism amounts to finding the hyperplane separating the training 
data with the largest margin (distance to the closest examples 
called support vectors). For nonlinearly separable problems, the 
input space can be mapped into a higher dimensional space us- 
ing kernels, in the hope that the examples in the new hyperspace 
are linearly separable. A goo d presentatio n of the foundations of 
the method can be found in Vapnik ( 1995). Common choices for 
the kernels are n-th degree polynomial and Gaussian radial basis 
functions. The method can easily incorporate noisy boundaries 
by introducing regularization terms. We used radial basis func- 
tions kernels. The parameters of the method (the complexity or 
regularization parameter and the kernel scale) are optimized by 
grid search and 10-fold cross validation. 

4. Classifier performance 

One of the central problems of statistical learning from samples 
is that of estimating the expected error of the developed clas- 
sifiers. The final goal of automatic classification, as mentioned 
above, is to facilitate the analysis of large amounts of data which 
would otherwise be left unexplored because the amount of time 
needed for humans to undertake such an analysis is incommen- 
surably large. This necessity cannot mask the fact that classifiers 
have errors and these need to be quantified if scientific hypothe- 
ses are to be drawn from their products. 

When developing a classifier, the goal is to maximize the 
number of correct classifications of new cases. Given the classi- 
fication method, the performance of a supervised classification 
depends, amongst other factors that measure the faithfulness of 
the representation of the real probability densities by the training 
set, on the quality of the descriptive parameters used for classi- 
fying. We seek a set of classification attributes which describes 
most light curves well and provides a good separation of the 
classes in attribute space. 

Several methodologies exist to evaluate classifiers. A com- 
mon way of testing a classifier's performance is feeding it with 
objects with a known member class and derive how many of 
them are correctly classified. This method is called 'cross vali- 
dation' in the case that the complete training set is split up into 
two disjointed sets: a training set and a set that will be classified, 
called the validation set. It is also possible to use the complete 
set for both training and classifying. This is known as 'resam- 
pling' . This is no longer a cross validation experiment, since the 
objects used for training and for classifying are the same. For 
a real cross validation experiment, the objects to classify must 
be different from the objects in the training set, in order to have 
statistical independence. The resampling method thus has a bias 
towards optimistic assessment of the misclassificationrate, com- 
pared to a cross validation method. Another possibility (called 
holdout procedure) consists of training the classifier with a sub- 
set of the set of examples and evaluating its error rates with the 
remainder. Depending on the percentage split it can be biased as 
well, but this time in the opposite (pessimistic) direction. Finally, 
the most common approach to validating classification models 
is called £-fold cross validation. This consists of dividing the set 
of examples into k folds, repeating k times the process of train- 
ing the model with k - 1 folds and evaluating it with the k-th 
fold not used for training. Several improvements to this method 
can be implemented to reduce the variance of its estimates, e.g. 
by assuring a proportional representation of classes in the folds 
(stratified cross validation). Recent proposals include bolstered 
resubstitution and several variants. Good and recent overviews 
of the p roblem with ref eren ces to relevant b ibliography can be 
found in lDemsarld2006l) and lBouckaertl d2004l) . 
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4. 1 . Gaussian mixture classifier 

For the simplest classifier, we also considered the simplest per- 
formance test by adopting the resampling approach. Using this 
method, we already get an idea of the overlap and separability 
of the classes in parameter space. 

The total number of correct classifications expressed as 
a percentage, can be rather misleading. For example, if our 
training set contains many example light curves for the well- 
separated classes, we will have a high rate of correct classifica- 
tions, even if the classifier performs very badly for some classes 
with only a small number of training objects. Therefore, it is bet- 
ter to judge the classifier's performance by looking at the 'con- 
fusion matrix'. This is a square matrix with rows and columns 
having a class label. It lists the numbers of objects assigned to 
every class in the training set after cross validation. The diago- 
nal elements represent the correct classifications, and their sum 
(trace of the matrix) divided by the total number of objects in the 
training set, equals the total correct classification rate. The off- 
diagonal elements show the number of misclassified (confused) 
objects and the classes they were assigned to. In this way, we 
get a clear view on the classifier's performance for every class 
separately. We can see which classes show high misclassifica- 
tion rates and are thus not very well separated using our set of 
classification attributes. 

Table[3] shows the confusion matrix for a subset of 25 vari- 
ability classes. These are the classes with more than 13 member 
stars each. We have chosen not to take the classes with fewer 
member stars into account here, because the number of clas- 
sification attributes is limited by the number of member stars 
in the class. This is merely a numerical limitation of the mul- 
tivariate Gaussian mixture classifier: if the number of defining 
class members is equal to or lower than the number of clas- 
sification attributes, the determinant of the variance-covariance 
matrix will become equal to zero. This makes it impossible to 
calculate the statistical distance with respect to the class us- 
ing Eq. ( [Tol l. We used 12 light curve parameters to perform the 
classification (the smallest class in this set contains 13 mem- 
ber stars): the three frequencies ft, the four amplitudes A\j, the 
phases PH\j, the linear trend b and the variance ratio. The aver- 
age correct classification rate is about 69% for this experiment. 
As can be seen from the matrix in Table|3] the monoperiodic pul- 
sators such as MIRA, CLCEP, DMCEP, RRAB, RRC and RRD 
are well separated. Some of the multiperiodic pulsators are also 
well identified (SPB, GDOR). A lot of misclassifications (fewer 
than 50% correct classifications) occur for the following mul- 
tiperiodic pulsators: BE, PVSG, DSCUT Also, some of the ir- 
regular and semi-regular variables show poor results (SR, WR, 
HAEBE, TTAU) as was emphasized in Sect. 2.2. 

Depending on the intended goal, it can be better to take fewer 
classes into account. For example, when the interest is focused 
on a few classes only, using fewer classes will decrease the risk 
of misclassifying members of those classes. To illustrate this, 
Table|4]shows the confusion matrix for only 14 classes using the 
complete set of 28 light curve parameters defined in Sect. 2. 1 to 
perform the classification. We did not include the classes with 
very irregular light curves or the less well-defined classes such 
as BE, CP, WR and PVSG. 

The average correct classification rate amounts to 92% for 
this experiment. It is clear that the monoperiodic ally pulsating 
stars are again very well separated (MIRA, CLCEP, DMCEP, 
RRAB, RRC and RRD). Most of the classes with multiperiodic 
variables also show high correct classification rates now (SPB, 
GDOR). Confusion is still present for the DSCUT and the BCEP 



classes. This is normal, as these stars have non-radial oscilla- 
tions with similar amplitudes and with overlap in frequencies. 
For those classes in particular, additional (or different) attributes 
are necessary to distinguish them, e.g. the use of a color in- 
dex as we will discuss extensively in our future application of 
the methods to the OGLE database. Parameter overlap (similar- 
ity) with other classes is the main reason for misclassifications 
if only light curves in a single photometric band are available. 
Note the high correct classification rate for the three classes of 
eclipsing binaries (EA, EB and EW). Some of their light curves 
(mainly from the EA class) are highly non-sinusoidal, but they 
are nevertheless well described with our set of attributes. 

The higher correct classification rates for this classification 
experiment with 14 classes is caused by the removal of the most 
'confusing' classes compared to the experiment with 25 classes, 
and the increased number of discriminating attributes (this was 
tested separately). Note that the use of fewer classes for classi- 
fying also implies more contamination of objects which actually 
belong to none of the classes in the training set. This can effec- 
tively be solved by imposing limits on the Mahalanobis distance 
to the class centers. Objects with a Mahalanobis distance larger 
than a certain user-defined value, will then not be assigned to 
any class. 

4.2. Machine learning techniques 

Selecting a methodology amongst several possible choices is in 
itself a statistical problem. Here we only summarize the results 
of a complete study comparing the approaches listed in Sect. 3.2, 
the details of which will be published in a specialized journal in 
the area of Pattern Recognition. 

As explained in Sect. 3.2, one reason models can fail is that 
they are not flexible enough to describe the decision boundaries 
required by the data (the bias error). Another reason is because 
the training set, due to its finite number of samples, is never 
a perfect representation of the real probability densities (other- 
wise one would work directly with them and not with examples). 
Since learning algorithms are inevitably bound to use the train- 
ing set to construct the model, any deficiency or lack of faith- 
fulness in their representation of the probability densities will 
translate into errors. The bias-variance trade-off explained above 
is somehow a way to prevent the learning algorithm from adjust- 
ing itself too tightly to the training data (a problem known as 
overfitting) because its ability to generalize depends critically 
on it. Finally, irrespective of all of the above, we cannot avoid 
dealing with confusion regions, i.e., regions of parameter space 
where different classes coexist. 

For the machine learning technique, we selected the combi- 
nation of 10 sorted runs of 10-fold cross validation experiments 
together with the standard t-test (Demsar 2006). This combina- 
tion assures small bias, a reduced variance (due to the repetition 
of the cross validation experiments) and replicability, an issue 
of special importance since these analyses will be iterated as 
the training set will be completed with new instances for the 
poorly represented classes and new attributes from projects such 
as CoRoT, Kepler and Gaia. 

In the following, we have split the results for single stage and 
sequential classifiers. It should be born in mind that the misclas- 
sification rates used in the following sections include entries in 
the confusion matrices which relate eclipsing binary subtypes. 
These are amongst the largest contributions to the overall mis- 
classificatio n rate and are due to a poor definition of the subtypes 
as argued in lSarro et al.l d2006l) and as is widely known. In future 
applications of the classifiers (i.e. for CoRoT data) the special- 



Table 3. The Confusion Matrix for the Gaussian Mixture method, using 25 variability classes and 12 classification attributes. The last but one line lists the total number of light 
curves (TOT) to define every class. The last line lists the correct classification rate (CC) for every class separately. The average correct classification rate is about 69%. 
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Table 4. The confusion matrix for the Gaussian mixture method using 14 variability classes and 28 classification attributes. The last 
but one line lists the total number of light curves (TOT) to define every class. The last line lists the correct classification rate (CC) 
for every class separately. The average correct classification rate is about 92%. 
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ized classifier presented in ISarro et al.l (120061) and its classifica- 
tion scheme will be used. Therefore, the misclassification rates 
quoted below are too pessimistic by an estimated 2%. 

4.2.1 . Single stage classifiers 

Table[5]shows the confusion matrix for the Bayesian model aver- 
aging of artificial neural networks. This methodology produces 
an average correct classification rate of 70%. For comparison, 
the second best single stage classifier measured by this figure is 
the 3-dependent Bayesian classifier with an overall rate of suc- 
cess of 66%. 

According to the t-test run applied to the ten sorted runs 
of 10-fold cross validation, the probability of finding this dif- 
ference under the null hypothesis is below 0.05%. However, 
this difference (equivalent to 73 more instances classified cor- 
rectly by the ensemble of neural networks) has to be put into 
the context of a more demanding computational requirement of 
the method (several hours training time in a single 2.66 GHz 
processor) compared to the almost instantaneous search for the 
Bayes ian network. For comparison, the classical C4.5 algorithm 
(Ouinlan 1993) attains only slightly worse performances (aver- 
ages of 65.2) at the expense of a more costly determination of 
the optimal parameters and greater variance with respect to the 
training sample. 

Support Vector Machines obtain much poorer results (of the 
order of 50% correct identifications). We searched the parame- 
ter space as closely as possible given the computational needs 
of a cross validation experiment with 30 classes. The best com- 
bination found is not able to compete with other approaches. It 
is always possible that we missed an island of particularly good 
performance in the grid search but the most plausible explana- 



tion for the seemingly poor results is that SVMs are not opti- 
mized for multiclass problems. These are typically dealt with by 
reducing them to many two-class problems, but most implemen- 
tations assume a common value of the parameters (complexity 
and radial basis exponent in our case) for all boundaries. 



4.2.2. Sequential classifiers 

One of the most relevant characteristics of the stellar variabil- 
ity classification problem is the rather high number of classes 
dealt with. Trying to construct a single stage classifier for such 
a large number of different classes implies a trade off between 
the optimal values of the model parameters in different regions 
of attribute space. We constructed an optimal multistage classi- 
fier in the perspective of dividing the classification problem into 
several stages, during each of which a particular subset of the 
classes is separated from the rest. 

We have selected four subgroups, one for each of the stages 
of the classifier. The choice was based on the internal similar- 
ities between instances in a group (intra cluster distances) and 
the separations between different groups. The four groups are 
eclipsing binaries (EA, EB, EW), Cepheids (CLCEP, PTCEP, 
RVTAU, DMCEP), long period variables (MIRA, SR) and RR- 
Lyrae stars (RRAB, RRC, RRD). These groups are character- 
ized by having significant entries in the confusion matrices for 
elements in each group and small contributions to these matrices 
across groups. We have trained sequential classifiers in the sense 
that the subsequent classifiers are not trained with the classes 
identified first. For example, if the first stage classifier is trained 
to separate eclipsing variables from the others, the second classi- 
fier will not have eclipsing variables in its training set. This way, 



Table 5. The confusion matrix for the Bayesian model averaging of artificial neural networks. The last but one line lists the total number of light curves (TOT) to define every 
class. The last line lists the correct classification rate (CC) for every class separately as measured by 10 fold cross validation. 
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given an instance, we can construct the complete class probabil- 
ity table as a product of conditional probabilities. 

The experiments consists of performing 10 runs of 10- 
fold cross validation for each stage with SVMs, Bayesian k- 
dependent networks and Bayesian neural network averages. The 
order in which the groups are filtered is altered in order to test the 
24 possible permutations. Each stage is preceded by a feature se- 
lection process that selects the optimal subset of features for each 
particular problem (as opposed to the single feature selection 
step of single stage classifiers). The results of the experiments 
consist of several confusion matrices of dimension 2 for each 
2 class problem, and several other confusion matrices for the 
classification of instances within these main groups. These latter 
matrices do not depend on the order of the assignment of groups 
to stages. With only one exception, all statistical tests were in- 
conclusive in the sense of not providing enough evidence for the 
rejection of the null hypothesis (having a threshold of 99.95%) 
that the classifiers have equal performance. The only exception 
is the eclipsing binaries classifier, where BAANN clearly out- 
performs all other methods. In all other cases the similarities in 
performance are remarkable. 

Table[6] shows the BAANN confusion matrices for the dif- 
ferent classification stages, while Tables|7] and [8] show the cor- 
responding matrices for the internal classification problem of 
each group and the remaining classes not assigned to any group. 
Finally, Table|9] shows the combined confusion matrix con- 
structed by multiplying conditional probabilities. For example, 
the probability of an instance being a classical Cepheid (CLCEP) 
is the probability of not being an eclipsing binary (first stage) 
times the probability of belonging to the Cepheids group (sec- 
ond stage) times the probability of being a classical Cepheid 
(specialized classifier). The average correct classification rate is 
about 66% for this classifier. 



5. Conclusions and Future Work 

We presented a uniform description of the most important stellar 
variability classes currently known. Our description is based on 
light curve parameters from well-known member stars for which 
high quality data are available. The parameters are derived us- 
ing Fourier analysis and harmonic fitting methods, and can be 
calculated on short timescales. The class descriptions obtained 
form the basis for a supervised classification method which pro- 
duces class probabilities given a set of time series attributes. It is 
shown that our class descriptions are accurate enough to separate 
the monoperiodic variables, some of the multiperiodic variables, 
and eclipsing binaries. An obvious improvement to these capa- 
bilities will come from the addition of color information to the 
class definitions. This will be discussed in a subsequent paper, 
where our methodology will be applied to the OGLE database. 

We obtained overall good classification results. The 
Gaussian mixture method is relatively simple and robust, and al- 
lows for an easy astrophysical interpretation. The machine learn- 
ing algorithms, on the other hand, achieve a lower rate of mis- 
classifications at the expense of longer training times, reduced 
interpretability and a higher level of statistical knowledge of the 
user. The following extensions/improvements are planned for the 
future: 

- Extending the number of variability classes and the number 
of member stars when more and better quality light curves 
become available, e.g. those from the exoplanet field data 
of the CoRoT mission. In particular, we will add the classes 



of stars with transits due to exoplanets, main-sequence stars 
with solar-like oscillations and stars with magnetic activity. 

- Improve the description of light curves by using methods 
other than Fourier analysis. Wavelet analysis, e.g., may be 
more suitable for describing non-periodic variables. Also, 
additional information derived from the shape of the power 
spectrum will be considered. 

- Adapt our codes with the intention to apply them to fu- 
ture large scale databases, such as those to be assembled by 
CoRoT, Kepler and Gaia. In particular, we are now preparing 
the classification of light curves which will be measured in 
the framework of the CoRoT Exoplanet programme. 

- Introduce cost matrices to generate specialized classifiers. 
When the goal of a classifier is to generate clean samples of a 
given class, it is generally preferred that the number of false 
positives be diminished even at the expense of missing some 
of the less clear candidates. In these cases, the introduction 
of cost matrices in machine learning algorithms allows for 
the differential weighting of errors in the training process, 
resulting in classifiers specialized in particular tasks. 

The results presented here are a brief summary of all the ex- 
periments and analyses that were applied to the data. For exam- 
ple, only summaries of the performance measures of some of 
the approaches taken were included. The full statistical analy- 
sis and detailed research to characterize the confusion regions of 
the classifiers will be published in a specialized journal (Sarro 
et al., in preparation). In particular, questions about the general 
properties of the subsamples of class i misclassified as j, their 
class probability distributions and if they constitute a separa- 
ble subset of class i were not discussed here, to avoid entering 
into a highly technical discussion. Given the large number of 
classes that constitute this classification problem, it is difficult 
to include the answers to these questions for the more than 400 
possible combinations of i and j but, at the same time, they are of 
paramount importance for the correct interpretation of classifier 
results. This analysis is available from the authors upon request. 
It is important to realize that the methodology presented here 
should be evaluated in a statistical sense, i.e., one can never be 
sure that each individual star is correctly and unambiguously 
classified. Our method was specifically designed for databases 
that are so large that individual inspection of all the stars is im- 
possible. Of course, such inspection can and should be done after 
a first classification with our methods has been achieved, for the 
specific class of interest to the user. We also note that our ba- 
sic methods described here assume the simplest possible input: 
single-band photometric time series. Any additional independent 
information (color indices or time series, radial velocity time se- 
ries, spectral type or logg, etc.) will imply an improved perfor- 
mance as will be shown in our application to the OGLE database 
for which such additional attributes were retrieved through the 
Virtual Observatory. We will judge the performance of the dif- 
ferent classifiers presented here in a future paper by comparing 
our classifications for the OGLE stars with published results ob- 
tained with extractor-type methods requiring human interaction. 
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Table 6. The confusion matrix for the Bayesian model averaging of artificial neural networks and the two class problem. The last 
but one line lists the total number of light curves (TOT) to define every class. The last line lists the correct classification rate (CC) 
for every class separately as measured by 10-fold cross validation. Separation between: A: eclipsing binaries (ECL) and all other 
types; B: Cepheids (CEP) and all other types; C: long period variables (LPV) and all other types except ECL and CEP; D: RR Lyrae 
stars (RR) from all other types except ECL, CEP and LPV. 
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Table 7. The confusion matrix for the Bayesian model averaging of artificial neural networks. The last but one line lists the total 
number of light curves (TOT) to define every class. The last line lists the correct classification rate (CC) for every class separately 
as measured by 10-fold cross validation. Separation between: A: Cepheids; B: long period variables; C: RR Lyrae stars. 
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Table 9. The complete confusion matrix for the Bayesian model averaging of artificial neural networks. The last but one line lists the total number of light curves (TOT) to define 
every class. The last line lists the correct classification rate (CC) for every class separately as measured by 10-fold cross validation. 
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Fig. 5. The range in amplitudes Aij for the 3 higher harmonics of /[, and the linear trend b. For visibility reasons, we have plotted 
the logarithm of the amplitude values. 
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Fig. 6. The range for the frequencies fa and fj, and the phases PH\j of the higher harmonics of f\. For visibility reasons, we have 
plotted the logarithm of the frequency values. Note the split into two clouds of the phase values PH\t, for the eclipsing binary classes. 
This is a computational artefact: phase values close to -n are equivalent to values close to +n, so the clouds actually represent a 
single cloud. 
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Fig. 7. The range in amplitudes A^j for the 4 harmonics of fa . For visibility reasons, we have plotted the logarithm of the amplitude 
values. 
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Fig. 8. The range in amplitudes A3J for the 4 harmonics of fe . For visibility reasons, we have plotted the logarithm of the amplitude 
values. 
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Fig. 9. The range in phases PHij for the 4 harmonics of /i- As can be seen from the plots, the distribution of these parameters is 
rather uniform for every class. They are unlikely to be good classification parameters, since for none of the classes, clear clustering 
of the phase values is present. 
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Fig. 10. The range in phases PHt,j for the 4 harmonics of /j. The same comments as for Fig.|9]apply here. 



J. Debosscher et al.: Automated supervised classification of variable stars, Online Material p 8 




Fig. 11. Box-and-whiskers plot of the logarithm of /?2i for all classes in the training set. Central boxes represent the median and 
interquantile ranges (25 to 75%) and the outer whiskers represent rule-of-thumb boundaries for the definition of outliers (1.5 the 
quartile range). The boxes widths are proportional to the number of examples in the class. 
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Fig. 12. Box-and-whiskers plot of the logarithm of the variance ratio v/i /v (varrat) for all classes in the training set. Central boxes 
represent the median and interquantile ranges (25 to 75%) and the outer whiskers represent rule-of-thumb boundaries for the defi- 
nition of outliers (1.5 the quartile range). The boxes widths are proportional to the number of examples in the class. 



